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Abstract 

The current dominant paradigm for feature learning in 
computer vision relies on training neural networks for the 
task of object recognition using millions of hand labelled 
images. Is it also possible to learn useful features for a di¬ 
verse set of visual tasks using any other form of supervision 
? In biology, living organisms developed the ability of vi¬ 
sual perception for the purpose of moving and acting in the 
world. Drawing inspiration from this observation, in this 
work we investigate if the awareness of egomotion can be 
used as a supervisory signal for feature learning. As op¬ 
posed to the knowledge of class labels, information about 
egomotion is freely available to mobile agents. We show 
that using the same number of training images, features 
learnt using egomotion as supervision compare favourably 
to features learnt using class-label as supervision on the 
tasks of scene recognition, object recognition, visual odom- 
etry and keypoint matching. 


”We move in order to see and we see in order to move” 

J.J Gibson 

1. Introduction 

Recent advances in computer vision have shown that vi¬ 
sual features learnt by neural networks trained for the task 
of object recognition using more than a million labelled im¬ 
ages are useful for many computer vision tasks like seman¬ 
tic segmentation, object detection and action classification 
csiiiniiiiEa. However, object recognition is one among 
many tasks for which vision is used. For example, humans 
use visual perception for recognizing objects, understand¬ 
ing spatial layouts of scenes and performing actions such 
as moving around in the world. Is there something special 
about the task of object recognition or is it the case that use¬ 
ful visual representations can be learnt through other modes 
of supervision? Clearly, biological agents perform complex 
visual tasks and it is unlikely that they require external su¬ 
pervision in form of millions of labelled examples. Unla¬ 
belled visual data is freely available and in theory this data 


can be used to learn useful visual representations. However, 
until now unsupervised learning approaches (H [22l [29l |32l 
have not yet delivered on their promise and are nowhere to 
be seen in current applications on complex real world im¬ 
agery. 

Biological agents use perceptual systems for obtaining 
sensory information about their environment that enables 
them to act and accomplish their goals MB- Both biolog¬ 
ical and robotic agents employ their motor system for exe¬ 
cuting actions in their environment. Is it also possible that 
these agents can use their own motor system as a source of 
supervision for learning useful perceptual representations? 
Motor theories of perception have a long history iniiii, 
but there has been little work in formulating computational 
models of perception that make use of motor information. 
In this work we focus on visual perception and present a 
model based on egomotion (i.e. self motion) for learning 
useful visual representations. When we say useful visual 
representations El, we mean representations that possess 
the following two characteristics - (1) ability to perform 
multiple visual tasks and (2) ability of performing new vi¬ 
sual tasks by learning from only a few labeled examples 
provided by an extrinsic teacher. 

Mobile agents are naturally aware of their egomotion 
(i.e. self-motion) through their own motor system. In other 
words, knowledge of egomotion is “freely” available. For 
example, the vestibular system provides the sense of orien¬ 
tation in many mammals. In humans and other animals, the 
brain has access to information about eye movements and 
the actions performed by the animal O . A mobile robotic 
agent can estimate its egomotion either from the motor com¬ 
mands it issues to move or from odometry sensors such as 
gyroscopes and accelerometers mounted on the agent itself. 

We propose that useful visual representations can be 
learnt by performing the simple task of correlating visual 
stimuli with egomotion. A mobile agent can be treated like 
a camera moving in the world and thus the knowledge of 
egomotion is the same as the knowledge of camera motion. 
Using this insight, we pose the problem of correlating vi¬ 
sual stimuli with egomotion as the problem of predicting 
the camera transformation from the consequent pairs of im- 
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ages that the agent receives while it moves. Intuitively, the 
task of predicting camera transformation between two im¬ 
ages should force the agent to learn features that are adept 
at identifying visual elements that are present in both the 
images (i.e. visual correspondence). In the past, features 
such as SIFT, that were hand engineered for finding corre¬ 
spondences were also found to be very useful for tasks such 
as object recognition GUlIll. This suggests that egomotion 
based learning can also result in features that are useful for 
such tasks. 

In order to test our hypothesis of feature learning using 
egomotion, we trained multilayer neural networks to pre¬ 
dict the camera transformation between pairs of images. As 
a proof of concept, we first demonstrate the usefulness of 
our approach on the MNIST dataset ED. We show that 
features learnt using our method outperform previous ap¬ 
proaches of unsupervised feature learning when class-label 
supervision is available only for a limited number of exam¬ 
ples (section [T4| ) Next, we evaluated the efficacy of our ap¬ 
proach on real world imagery. For this purpose, we used im¬ 
age and odometry data recorded from a car moving through 
urban scenes, made available as part of the KITTI and 
the San Francisco (SF) city (71 datasets. This data mim¬ 
ics the scenario of a robotic agent moving around in the 
world. The quality of features learnt from this data were 
evaluated on four tasks (1) Scene recognition on SUN 1381 
(section [5T] ), (2) Visual odometery (section [5^ , (3) Key- 
point matching (sectio n |5. 3 1 ) and (4) Object recognition on 
Imagenet ED (section |5.2[ ). Our results show that for the 
same amount of training data, features learnt using egomo¬ 
tion as supervision compare favorably to features learnt us¬ 
ing class-label as supervision. We also show that egomotion 
based pretraining outperforms a previous approach based on 
slow feature analysis for unsupervised learning from videos 
EIlElEa. To the best of our knowledge, this work pro¬ 
vides the first effective demonstration of learning visual rep¬ 
resentations from non-visual access to egomotion informa¬ 
tion in real world setting. 

The rest of this paper is organized as following: In sec¬ 
tion 1^ we discuss the related work, in section HiH] we 
present the method, dataset details and we conclude with 
the discussion in section |6l 

2. Related Work 

Past work in unsupervised learning has been domi¬ 
nated by approaches that pose feature learning as the prob¬ 
lem of discovering compact and rich representations of 
images that are also sufficient to reconstruct the images 
|i[3l|22l|32l|22l[30l. Another line of work has focused on 
learning features that are invariant to transformations either 
from video EzHHEll or from images |[IIl[29l. 1241 per¬ 
form feature learning by modeling spatial transformations 
using boltzmann machines, but donot evaluate the quality 


of learnt features. 

Despite a lot of work in unsupervised learning (see HI 
for a review), a method that works on complex real world 
imagery is yet to be developed. An alternative to unsuper¬ 
vised learning is to learn features using intrinsic reward sig¬ 
nals that are freely available to a system (i.e self-supervised 
learning). For instance, csi used intrinsic reward signals 
available to a robot for learning features that predict path 
traversability, while |[28ll trained neural networks for driv¬ 
ing vehicles directly from visual input. 

In this work we propose to use non-visual access to ego¬ 
motion information as a form of self-supervision for visual 
feature learning. Unlike any other previous work, we show 
that our method works on real world imagery. Closest to our 
method is the the work of transforming auto-encoders |[T^ 
that used egomotion to reconstruct the transformed image 
from an input source image. This work was purely con¬ 
ceptual in nature and the quality of learned features was not 
evaluated. In contrast, our method uses egomotion as super¬ 
vision by predicting the transformation between two images 
using a siamese-like network model El. 

Our method can also be seen as an instance of feature 
learning from videos. Enimsi perform feature learn¬ 
ing from videos by imposing the constraint that tempo¬ 
rally close frames should have similar feature representa¬ 
tions (i.e. slow feature analysis) without accounting for ei¬ 
ther the camera motion or the motion of objects in the scene. 
In many settings the camera motion dominates the motion 
content of the video. Our key observation is that knowl¬ 
edge of camera motion (i.e. egomotion) is freely available 
to mobile agents and can be used as a powerful source of 
self-supervision. 

3. A Simple Model of Motion-based Learning 

We model the visual system of the agent with a Con¬ 
volutional Neural Network (CNN, 1201 ). The agent opti¬ 
mizes its visual representations (i.e. updating the weights 
of the CNN) by minimizing the error between the egomo¬ 
tion information (i.e. camera transformation) obtained from 
its motor system and egomotion predicted using its visual 
inputs only. Performing this task is equivalent to train¬ 
ing a CNN with two streams (i.e. Siamese Style CNN or 
SCNNISI) that takes two images as inputs and predicts the 
egomotion that the agent underwent as it moved between 
the two spatial locations from which the two images were 
obtained. In order to learn useful visual representations, the 
agent continuously performs this task as it moves around in 
its environment. 

In this work we use the pretraining-finetuning paradigm 
for evaluating the utility of learnt features. Pretraining is the 
process of optimizing the weights of a randomly initialized 
CNN for an auxiliary task that is not the same as the target 
task. Finetuning is the process of modifying the weights of 
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Figure 1: Exploring the utility of egomotion as supervision for learning useful visual features. A mobile agent equipped 
with visual sensors receives a sequence of images as inputs while it moves in its environment. The movement of the agent 
is equivalent to the movement of a camera. In this work, egomotion based learning is posed as the problem of predicting 
camera transformation from image pairs. The top and bottom rows of the figure show some sample image pairs from the SF 
and KITTI datasets that were used for feature learning. 


a pretrained CNN for the given target task. Our experiments 
compare the utility of features learnt using egomotion based 
pretraining against class-label based and slow-feature based 
pretraining on multiple target tasks. 

3.1. Two Stream Architecture 

Each stream of the CNN independently computes fea¬ 
tures for one image. Both streams share the same ar¬ 
chitecture and the same set of weights and consequently 
perform the same set of operations for computing fea¬ 
tures. The individual streams have been called as Base- 
CNN (BCNN). Features from two BCNNs are concatenated 
and passed downstream into another CNN called as the Top- 
CNN (TCNN) (see figure]^. TCNN is responsible for using 
the BCNN features to predict the camera transformation be¬ 
tween the input pair of images. After pretraining, the TCNN 
is removed and a single BCNN is used as a standard CNN 
for feature computation for the target task. 

3.2. Shorthand for CNN architectures 

The abbreviations Ck, Fk, P, D, Op represent a convo- 
lutional(C) layer with k filters, a fully-connected(F) layer 
with k filters, pooling(P), dropout(D) and the output(Op) 
layers respectively. We used ReLU non-linearity after every 
convolutional/fully-connected layer, except for the output 
layer. The dropout layer was always used with dropout of 
0.5. The output layer was a fully connected layer with num¬ 
ber of units equal to the number of desired outputs. As an 
example of our notation, C96-P-F500-D refers to a network 
with 96 filters in the convolution layer followed by ReLU 
non-linearity, a pooling layer, a fully-connected layer with 
500 unit, ReLU non-linearity and a dropout layer. We used 



Figure 2: Description of the method for feature learning. 
Visual features are learnt by training a Siamese style Con¬ 
volutional Neural Network (SCNN, |0) that takes as inputs 
two images and predicts the transformation between the im¬ 
ages (i.e. egomotion). Each stream of the SCNN (called 
as Base-CNN or BCNN) computes features for one image. 
The outputs of two BCNNs are concatenated and passed 
as inputs to a second multilayer CNN called as the Top- 
CNN (TCNN) (shown as layers Fi, F 2 ). The two BCNNs 
have the same architecture and share weights. After feature 
learning, TCNN is discarded and a single BCNN stream is 
used as a standard CNN for extracting features for perform¬ 
ing target tasks like scene recognition. 

CD for training all our models. 

3.3. Slow Feature Analysis (SFA) Baseline 

Slow Feature Analysis (SFA) is a method for feature 
learning based on the principle that useful features change 
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slowly in time. We used the following contrastive loss 
formulation of SFA 1^1^. 

L{xt^,Xt^,W) = 

D{xt^,xt^) if\ti-t2\<T 

1 — max{0,m — D{xti,Xt 2 )) if |fi — ^ 2 ! > T 

where, L is the loss, Xt ^, Xt^ refer to feature representa¬ 
tions of frames observed at times ti, ^2 respectively, W are 
the parameters that specify the feature extraction process, D 
is a measure of distance with parameter, m is a predefined 
margin and T is a predefined time threshold for determining 
whether the two frames are temporally close or not. In this 
work, Xt are features computed using a CNN with weights 
W and D was chosen to be the L2 distance. SFA pretrain¬ 
ing was performed using two stream architectures that took 
pairs of images as inputs and produced outputs Xf ^, Xf^ as 
outputs from the two streams respectively. 

3.4. Proof of Concept using MNIST 

On MNIST, egomotion was emulated by generating syn¬ 
thetic data consisting of random transformation (transla¬ 
tions and rotations) of digit images. From the training set of 
60K images, digits were randomly sampled and then trans¬ 
formed using two different sets of random transformations 
to generate image pairs. CNNs were trained for predicting 
the transformations between these image pairs. 

3.4.1 Data 

For egomotion based pretraining, relative translation be¬ 
tween the digits was constrained to be an integer value in 
the range [-3, 3] and relative rotation 0 was constrained to 
lie within the range [-30°, 30°]. The prediction of transfor¬ 
mation was posed as a classification task with three sepa¬ 
rate soft-max losses (one each for translation along X, Y 
axes and the rotation about Z-axis). SCNN was trained 
to minimize the sum of these three losses. Translations 
along X, Y were separately binned into seven uniformly 
spaced bins each. The rotations were binned into bins of 
size 3°each resulting into a total of 20 bins (or classes). For 
SFA based pretraining, image pairs with relative translation 
in the range [-1, 1] and relative rotation within [-3°, 3°] 
were considered to be temporally close to each other (see 
equation [^. A total of 5 million image pairs were used for 
both pretraining procedures. 

3.4.2 Network Architectures 

We experimented with multiple BCNN architectures and 
chose the optimal architecture for each pretraining method 
separately. For egmotion based pretraining, the two BCNN 
streams were concatenated using the TCNN: FlOOO-D-Op. 


Table 1: Comparison of various pretraining methods on 
MNIST reveals that egomotion based pretraining outper¬ 
forms many previous approaches for unsupervised learning. 
The performance is reported as the error rate. 


Method 

# examples for finetuning 


100 

300 

1000 

10000 

Autoencoder 1171 

24.1 

12.2 

7.7 

4.8 

Ranzato et al. (291 

- 

7.18 

3.21 

0.85 

Lee et al. (^ 

- 

- 

2.62 

- 

Train from Scratch 

20.1 

8.3 

4.5 

1.6 

SFA (m=10) 

11.2 

6.4 

3.5 

2.1 

SFA (m=100) 

11.9 

6.4 

4.8 

4.7 

Egomotion (ours) 

8.7 

3.6 

2.0 

0.9 


Pretraining was performed for 40K iterations (i.e. 5M ex¬ 
amples) using an initial learning rate of 0.01 which was re¬ 
duced by a factor of 2 after every lOK iterations. 

The following architecture was used for finetuning: 
BCNN-F500-D-Op. In order to evaluate the quality of 
BCNN features, the learning rate of all layers in the BCNN 
were set to 0 during finetuning for digit classification. Fine- 
tuning was performed for 4K iterations (which is equivalent 
to training for 50 epochs for the lOK labelled training ex¬ 
amples) with a constant learning rate of 0.01. 


3.4.3 Results 

The BCNN features were evaluated by computing the er¬ 
ror rates on the task of digit classification using 100, 300, 
IK and lOK class-labelled examples for training. These 
sets were constructed by randomly sampling digits from the 
standard training set of 60K digits. For this part of the ex¬ 
periment, the original digit images were used (i.e. without 
any transformations or data augmentation). The standard 
test set of lOK digits was used for evaluation and error rates 
averaged across 3 runs are reported in table 

The BCNN architecture: C96-P-C256-P, was found to 
be optimal for egomotion and SFA based pretraining and 
also for training from scratch (i.e. random weight initial¬ 
ization). Results for other architectures are provided in the 
supplementary material. For SFA based pretraining, we ex¬ 
perimented with multiple values of the margin m and found 
that m = 10,100 led to the best performance. Our method 
outperforms convolutional deep belief networks 1221 . a pre¬ 
vious approach based on learning features invariant to trans¬ 
formations and SFA based pretraining. 
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4. Learning Visual Features From Egomotion 
in Natural Environments 


We used two main sources of real world data for feature 
learning: the KITTI and SF datasets, which were collected 
using cameras and odometry sensors mounted on a car driv¬ 
ing through urban scenes. Details about the data, the exper¬ 
imental procedure, the network architectures and the results 
are provided in sections 4.1 4.2 4^ and [^respectively. 

4.1. KITTI Dataset 


The KITTI dataset provided odometry and image data 
recorded during 11 short trips of variable length made by 
a car moving through urban landscapes. The total number 
of frames in the entire dataset was 23,201. Out of 11, 9 
sequences were used for training and 2 for validation. The 
total number of images in the training set was 20,501. 

The odometry data was used to compute the camera 
transformation between pairs of images recorded from the 
car. The direction in which the camera pointed was as¬ 
sumed to be the Z axis and the image plane was taken to 
be the XY plane. X-axis and Y-axis refer to horizontal and 
vertical directions in the image plane. As significant cam¬ 
era transformations in the KITTI data were either due to 
translations along the Z/X axis or rotation about the Y axis, 
only these three dimensions were used to express the cam¬ 
era transformation. The rotation was represented as the eu- 
ler angle about the Y-axis. The task of predicting the trans¬ 
formation between pair of images was posed as a classifica¬ 
tion problem. The three dimensions of camera transforma¬ 
tion were individually binned into 20 uniformly spaced bins 
each. The training image pairs were selected from frames 
that were at most ±7 frames apart to ensure that images in 
any given pair would have a reasonable overlap. For SFA 
based pretraining, pairs of frames that were separated by at- 
most ±7 frames were considered to be temporally close to 
each other. 

The SCNN was trained to predict camera transformation 
from pairs of227x227 pixel sized image regions extracted 
from images of overall size 370 x 1226 pixels. For each 
image pair, the coordinates for cropping image regions were 
randomly chosen. Figureillustrates typical image crops. 

4.2. SF Dataset 


SF dataset provides camera transformation between « 
136K pairs of images (constructed from a set of 17,357 
unique images). This dataset was constructed using Google 
StreetView m ^ 130iT image pairs were used for training 
and ^ 6K pairs for validation. 

Just like KITTI, the task of predicting camera trans¬ 
formation was posed as a classification problem. Unlike 
KITTI, significant camera transformation was found along 
all six dimensions of transformation (i.e. the 3 euler angles 
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(a) KITTI-Net (b) SF-Net 


Figure 3: Visualization of layer 1 filters learnt by egomotion 
based pretraining on (a) KITTI and (b) SF datasets. A large 
majority of layer-1 filters are color detectors and some of 
them are edge detectors. This is expected as color is a useful 
cue for determining correspondences between image pairs. 


and the 3 translations). Since, it is unreasonable to expect 
that visual features can be used to infer big camera trans¬ 
formations, rotations between [-30°, 30°] were binned into 
10 uniformly spaced bins and two extra bins were used for 
rotations larger and smaller than 30°and -30°respectively. 
The three translations were individually binned into 10 uni¬ 
formly spaced bins each. Images were resized to a size of 
360 X 480 and image regions of size 227 x 227 were used 
for training the SCNN. 

4.3. Network Architecture 

BCNN closely followed the architecture of first five 
AlexNet layers 113: C96-P-C256-P-C384-C384-C256-P. 
TCNN architecture was: C256-C128-F500-D-Op. The con¬ 
volutional filters in the TCNN were of spatial size 3x3. The 
networks were trained for 60K iterations with a batch size 
of 128. The initial learning rate was set to 0.001 and was 
reduced by a factor of two after every 20K iterations. 

We term the networks pretrained using egomotion on 
KITTI and SF datasets as KITTI-Net and SF-Net respec¬ 
tively. The net pretrained on KITTI with SFA is called 
KITTI-SFA-Net. Figure shows the layer-1 filters of 
KITTI-Net and SF-Net. A large majority of layer-1 filters 
are color detectors, while some of them are edge detectors. 
As color is a useful cue for determining correspondences 
between closeby frames of a video sequence, learning of 
color detectors as layer-1 filters is not surprising. The frac¬ 
tion of filters that detect edges is higher for the SF-Net. This 
is not surprising either, because higher fraction of images in 
the SF dataset contain structured objects like buildings and 
cars. 

5. Evaluating Motion-based Learning 

For evaluating the merits of the proposed approach, fea¬ 
tures learned using egomotion based supervision were com¬ 
pared against features learned using class-label and SFA 
based supervision on the challenging tasks of scene recogni- 
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tion, intra-class keypoint matching and visual odometry and 
object recognition. The ultimate goal of feature learning is 
to find features that can generalize from only a few super¬ 
vised examples on a new task. Therefore it makes sense to 
evaluate the quality of features when only a few labelled ex¬ 
amples for the target task are provided. Consequently, the 
scene and object recognition experiments were performed 
in the setting when only 1-20 labelled examples per class 
were available for finetuning. 

The KITTI-Net and SF-Net (examples of models trained 
using egomotion based supervision) were trained using only 
only ^ 20K unique images. To make a fair comparison 
with class-label based supervision, a model with AlexNet 
architecture was trained using only 20K images taken from 
the training set of ILSVRC12 challenge (i.e. 20 examples 
per class). This model has been referred to as AlexNet-20K. 
In addition, some experiments presented in this work also 
make comparison with AlexNet models trained with lOOK 
and IM images that have been named as AlexNet-lOOK and 
AlexNet-IM respectively. 

5.1. Scene Recognition 

SUN dataset consisting of 397 indoor/outdoor scene cat¬ 
egories was used for evaluating scene recognition perfor¬ 
mance. This dataset provides 10 standard splits of 5 and 20 
training images per class and a standard test set of 50 im¬ 
ages per class. Due to time limitation of running 10 runs of 
the experiment, we evaluated the performance using only 3 
train/test splits. 

For evaluating the utility of CNN features produced by 
different layers, separate linear (SoftMax) classifiers were 
trained on features produced by individual CNN layers 
(i.e. BCNN layers of KITTI-Net, KITTI-SFA-Net and SF- 
Net). Table 1^ reports recognition accuracy (averaged over 
3 train/test splits) for various networks considered in this 
study. KITTI-Net outperforms SF-Net and is comparable 
to AlexNet-20K. This indicates that given a fixed budget 
of pretraining images, egomotion based supervision learns 
features that are almost as good as the features using class- 
based supervision on the task of scene recognition. The per¬ 
formance of features computed by layers 1-3 (abbreviated 
as LI, L2, L3 in table|^ of the KITTI-SFA-Net and KITTI- 
Net is comparable, whereas layer 4, 5 features of KITTI-Net 
significantly outperform layer 4, 5 features of KITTI-SFA- 
Net. This indicates that egomotion based pretraining results 
into learning of higher-level features, while SFA based pre¬ 
training results into learning of lower-level features only. 

The KITTI-Net outperforms GIST||26|, which was 
specifically developed for scene classification, but is out¬ 
performed by Dense SIFT with spatial pyramid matching 
(SPM) kernel |[I3. The KITTI-Net was trained using lim¬ 
ited visual data 20K frames) containing visual imagery 
of limited diversity. The KITTI data mainly contains images 


of roads, buildings, cars, few pedestrians, trees and some 
vegetation. It is in fact surprising that a network trained on 
data with such little diversity is competitive on classifying 
indoor and outdoor scenes with the AlexNet-20K that was 
trained on a much more diverse set of images. We believe 
that with more diverse training data for egomotion based 
learning, the performance of learnt features will be better 
than currently reported numbers. 

The KITTI-Net outperformed the SF-Net except for the 
performance of layer 1 (LI). As it was possible to extract a 
larger number of image region pairs from the KITTI dataset 


as compared to the SF dataset (see section 4.1 4.2), the 


result that KITTI-Net outperforms SF-Net is not surprising. 
Because KITTI-Net was found to be superior to the SF-Net 
in this experiment, the KITTI-Net was used for all other 
experiments described in this paper. 


5.2. Object Recognition 

If egomotion based pretraining learns useful features for 
object recognition, then a net initialized with KITTI-Net 
weights should outperform a net initialized with random 
weights on the task of object recognition. For testing this, 
we trained CNNs using 1, 5, 10 and 20 images per class 
from the ILSVRC-2012 challenge. As this dataset contains 
1000 classes, the total number of training examples avail¬ 
able for training for these networks were IK, 5K, lOK and 
20K respectively. All layers of KITTI-Net, KITTI-SFA-Net 
and AlexNet-Scratch (i.e. CNN with random weight initial¬ 
ization) were finetuned for image classification. 

The results of the experiment presented in table 
show that egomotion based supervision (KITTI-Net) clearly 
outperforms SFA based supervision(KITTI-SFA-Net) and 
AlexNet-Scratch. As expected, the improvement offered by 
motion-based pretraining is larger when the number of ex¬ 
amples provided for the target task are fewer. These result 
show that egomotion based pretraining learns features use¬ 
ful for object recognition. 


5.3. Intra-Class Keypoint Matching 

Identifying the same keypoint of an object across differ¬ 
ent instances of the same object class is an important visual 
task. Visual features learned using egomotion, SFA and 
class-label based supervision were evaluated for this task 
using keypoint annotations on the PASCAL dataset m 

Keypoint matching was computed in the following way: 
First, ground-truth object bounding boxes (CT-BBOX) 
from PASCAL-VOC2012 dataset were extracted and re¬ 
sized (while preserving the aspect ratio) to ensure that the 
smaller side of the boxes was of length 227 pixels. Next, 
feature maps from layers 2-5 of various CNNs were com¬ 
puted for every CT-BBOX. The keypoint matching score 
was computed between all pairs of CT-BBOX belonging to 
the same object class. For given pair of CT-BBOX, the fea- 
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Table 2: Comparing the accuracy of neural networks pre-trained using motion-based and class-label based supervision for 
the task of scene recognition on the SUN dataset. The performance of layers 1-6 (labelled as L1-L6) of these networks was 
evaluated after finetuning the network using 5/20 images per class from the SUN dataset. The performance of the KITTI-Net 
(i.e. motion-based pretraining) fares favorably with a network pretrained on Imagenet (i.e. class-based pretraining) with the 
same number of pretraining images (i.e. 20K). 


Method 

Pretrain Supervision 

#Pretrain 

#Finetune 

LI 

L2 

L3 

L4 

L5 

L6 

#Finetune 

LI 

L2 

L3 

L4 

L5 

L6 

AlexNet-IM 

Class-Label 

IM 

5 

5.3 

10.5 

12.1 

12.5 

18.0 

23.6 

20 

11.8 

22.2 

25.0 

26.8 

33.3 

37.6 

AlexNet-20K 

20K 

5 

4.9 

6.3 

6.6 

6.3 

6.6 

6.7 

20 

8.7 

12.6 

12.4 

11.9 

12.5 

12.4 

KITTI-SFA-Net 

Slowness 

20.5K 

5 

4.5 

5.7 

6.2 

3.4 

0.5 

- 

20 

8.2 

11.2 

12.0 

7.3 

1.1 

- 

SF-Net 

Egomotion 

18K 

5 

4.4 

5.2 

4.9 

5.1 

4.7 

- 

20 

8.6 

11.6 

10.9 

10.4 

9.1 

- 

KITTI-Net 

20.5K 

5 

4.3 

6.0 

5.9 

5.8 

6.4 

- 

20 

7.9 

12.2 

12.1 

11.7 

12.4 

- 

GIST 1381 

Human 

- 

5 

6.2 

20 

11.6 

SPM 1381 

Human 

- 

5 

8.4 

20 

16.0 


Table 3: Top-5 accuracy on the task of object recognition 
on the ILSVRC-12 validation set. AlexNet-Scratch refers 
to a net with AlexNet architecture initialized with randomly 
weights. The weights of KITTI-Net and KITTI-SFA-Net 
were learned using egomotion based and SFA based su¬ 
pervision on the KITTI dataset respectively. All the net¬ 
works were finetuned using 1, 5,10, 20 examples per class. 
The KITTI-Net clearly outperforms AlexNet-Scratch and 
KITTI-SFA-Net. 


Method 

1 

5 

10 

20 

AlexNet-Scratch 

1.1 

3.1 

5.9 

14.1 

KITTI-SFA-Net (Slowness) 

1.5 

3.9 

6.1 

14.9 

KITTI-Net (Egomotion) 

2.3 

5.1 

8.6 

15.8 


tures associated with keypoints in the first image were used 
to predict the location of the same keypoints in the second 
image. The normalized pixel distance between the actual 
and predicted keypoint locations was taken as the error in 
keypoint matching. More details about this procedure have 
been provided in the supp. materials. 

It is natural to expect that accuracy of keypoint matching 
would depend on the camera transformation between the 
two viewpoints of the object(i.e. viewpoint distance). In 
order to make a holistic evaluation of the utility of features 
learnt by different pretraining methods on this task, match¬ 
ing error was computed as a function of viewpoint distance 
(361. Figure 1^ reports the matching error averaged across 
all keypoints, all pairs of GT-BBOX and all classes using 
features extracted from layers conv-3 and conv-4. 

KITTI-Net trained only with 20K unique frames was 
superior to AlexNet-20K and AlexNet-lOOK and inferior 
only to AlexNet-IM. A net with AlexNet architecture ini¬ 
tialized with random weights (AlexNet-Rand), surprisingly 
performed better than AlexNet-20K. One possible expla¬ 


Table 4: Comparing the accuracy of various pretraining 
methods on the task of visual odometry. 


Method 

Translation Acc. 

Rotation Acc. 




dZ 

60i 

SO 2 

663 

SF-Net 

40.2 

58.2 

38.4 

45.0 

44.8 

40.5 

KITTI-Net 

43.4 

57.9 

40.2 

48.4 

44.0 

41.0 

AlexNet-IM 

41.8 

58.0 

39.0 

46.0 

44.5 

40.5 


nation for this observation is that with only 20K exam¬ 
ples, features learnt by AlexNet-20K only capture coarse 
global appearance of objects and are therefore poor at key- 
point matching. SIFT has been hand engineered for find¬ 
ing correspondences across images and performs as well 
as the best AlexNet-IM features for this task (i.e. conv-4 
features). KITTI-Net also significantly outperforms KITTI- 
SFA-Net. These results indicate that features learnt by ego- 
motion based pretraining are superior to SFA and class- 
label based pretraining for the task of keypoint matching. 


5.4. Visual Odometry 


Visual odometry is the task of estimating the camera 
transformation between image pairs. All layers of KITTI- 
Net and AlexNet-IM were finetuned for 25K iterations us¬ 
ing the training set of SF dataset on the task of visual odom¬ 


etry (see section 4.2 for task description). The performance 
of various CNNs was evaluated on the validation set of SF 
dataset and the results are reported in table 

Performance of KITTI-Net was either superior or com¬ 
parable to AlexNet-IM on this task. As the evaluation was 
made on the SF dataset itself, it was not surprising that on 
some metrics SF-Net outperformed KITTI-Net. The results 
of this experiment indicate that egomotion based feature 
learning is superior to class-label based feature learning on 
the task of visual odometry. 
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Figure 4: Intra-class keypoint matching error as a function of viewpoint distance averaged over 20 PASCAL objects using 
features from layers conv3 (left) and conv4 (right) of various CNNs used in this work. Please see the text for more details. 


6. Discussion 

In this work, we have shown that egomotion is a useful 
source of intrinsic supervision for visual feature learning in 
mobile agents. In contrast to class labels, knowledge of ego- 
motion is ’’freely” available. On MNIST, egomotion-based 
feature learning outperforms many previous unsupervised 
methods of feature learning. Given the same budget of pre¬ 
training images, on task of scene recognition, egomotion- 
based learning performs almost as well as class-label-based 
learning. Further, egomotion based features outperform fea¬ 
tures learnt by a CNN trained using class-label supervision 
on two orders of magnitude more data (AlexNet-lM) on the 
task of visual odometry and one order of magnitude more 
data on the task of intra-class keypoint matching. In ad¬ 
dition to demonstrating the utility of egomotion based su¬ 
pervision, these results also suggest that features learnt by 
class-label based supervision are not optimal for all visual 
tasks. This means that future work should look at what 
kinds of pretraining are useful for what tasks. 

One potential criticism of our work is that we have 
trained and evaluated high capacity deep models on rela¬ 
tively little data (e.g. only 20K unique images available on 
the KITTI dataset). In theory, we could have learnt bet¬ 
ter features by downsizing the networks. For example, in 
our experiments with MNIST we found that pretraining a 
2-layer network instead of 3-layer results in better perfor¬ 
mance (table [T]). In this work, we have made a conscious 
choice of using standard deep models because the main 
goal of this work was not to explore novel feature extrac¬ 
tion architectures but to investigate the value of egmotion 
for learning visual representations on architectures known 
to perform well on practical applications. Future research 
focused on exploring architectures that are better suited for 
egomotion based learning can only make a stronger case 
for this line of work. While egomotion is freely avail¬ 
able to mobile agents, there are currently no publicly avail¬ 
able datasets as large as Imagenet. Consequently, we were 


unable to evaluate the utility of motion-based supervision 
across the full spectrum of training set sizes. 

In this work, we chose to first pretrain our models using 
a base task (i.e. egomotion) and then finetune these mod¬ 
els for target tasks. An equally interesting setting is that 
of online learning where the agent has continuous access 
to intrinsic supervision (such as egomotion) and occasional 
explicit access to extrinsic teacher signal (such as the class 
labels). We believe that such a training procedure is likely 
to result in learning of better features. Our intuition behind 
this is that seeing different views of the same instance of an 
object (say) car, may not be sufficient to learn that different 
instances of the car class should be grouped together. The 
occasional extrinsic signal about object labels may prove 
useful for the agent to learn such concepts. Also, current 
work makes use of passively collected egomotion data and 
it would be interesting to investigate if it is possible to learn 
better visual representations if the agent can actively decide 
on how to explores its environment (i.e. active learning El) 
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Appendix 

A. Keypoint Matching Score 

Consider images of two instances of the same object 
class (for example airplane images as shown in first row 
of figure for which keypoint matching score needs to be 
computed. 

The images are pre-processed in the following way: 
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• Crop the groundtruth bounding box from the image. 

• Pad the images by 30 pixels along each dimension. 


• Resize each image so that the smallest side is 227 pix¬ 
els. The aspect ratio of the image is preserved. 

A.l. Keypoint Matching using CNN 

Assume that the layer of the CNN is used for feature 
computation. The feature map produced by the layer is 
of dimensionality I x J x M, where (/, J) are the spatial 
dimensions and M is the number of filters in the layer. 
Thus, the layer produces a M dimensional feature vector 
for each of the / x J grid position in the feature map. 

The coordinates of the keypoints are provided in the im¬ 
age coordinate system m For the keypoints in the first 
image, we first determine their grid position in the I x J 
feature map. Each grid position has an associated receptive 
field in the image. The keypoints are assigned to the grid 
positions for which the center of receptive field is closest 
to the keypoints. This means that each keypoint is assigned 
one location in the feature map. 

Let the M dimensional feature vector associated with the 
keypoint in the first image be F^. Let the M dimen¬ 
sional feature vector at grid location Cij for the second im¬ 
age be F 2 {Cij). The location of matching keypoint in the 
second image is determined by solving: 

a = argminc,, ||Ff - F2{Cij)\\2 ( 2 ) 

is transformed into the image coordinate system by com¬ 
puting the center of receptive field (in the image) associated 
with this grid position. Let this transformed coordinates be 
and the coordinates of the corresponding keypoint (in 
the second image) be Cg^. The matching error for the 
keypoint (E^) is defined as: 


Ek = 


iicr 





( 3 ) 


where, L‘^ is the length of diagonal (in pixels) of the second 
image. As different images have different sizes, dividing 
by normalizes for the difference in sizes. The matching 
error for a pair of images of instances belonging to the same 
class is calculated as: 




instance — 


K 


( 4 ) 


The average matching error across all pairs of the in¬ 
stance of the same class is given by Edass • 


E^ 


class 


E. 


ins tance Ejnstance 

^pairs 


( 5 ) 


where, jj^pairs is the number of pairs of object instances 
belonging to the same class. In Ligure 4 of the main pa¬ 
per we report the matching error averaged across all the 20 
classes. 


A.2. Keypoint Matching using SIFT 

SILT features are extracted using a square window of 
size 72 pixels and a stride of 8 pixels using the open source 
code from oa. The stride of 8 pixels was chosen to have a 
fair comparison with the CNN features. The CNN features 
were computed with a stride of 8 for layer conv-2 and stride 
of 16 for layers conv-3, conv-4 and conv-5 respectively. The 
matching error using SILT was calculated in the same way 
as for the CNNs. 

A.3. Effect of Viewpoint on Keypoint Matching 

Intuitively, matching instances of the same object that are 
related by a large transformation (i.e. viewpoint distance) 
should be harder than matching instances with a small view¬ 
point distance. Therefore, in order to obtain a holistic un¬ 
derstanding of the accuracy of features in performing key- 
point matching it is instructive to study the accuracy of 
matching as a function of viewpoint distance. 

1361 aligned instances of the same class (from PASCAL- 
VOC-2012) in a global coordinate system and provide a ro¬ 
tation matrix {R) for each instance in the class. To measure 
the viewpoint distance, we computed the riemannian met¬ 
ric on the manifold of rotation matrices \\log{RiR^)\\F, 
where log is the matrix logarithm, ||.||i? is the Lrobenius 
norm of the matrix and Ri , Rj are the rotation matrices for 
the instances respectively. We binned the distances 

into 10 uniform bins (of 18°each). In Ligure 4 of the main 
paper we show the mean error in keypoint matching in each 
of these viewpoints bin. The matching error in the k^^ bin is 
calculated by considering all the instances with a viewpoint 
distance < kx 18°, for k G [1,10]. As expected we find that 
keypoint matching is worse for larger viewpoint distances. 

A.4. Matching Error for layers 2 and 5 
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Figure 5: Example matchings between pairs of objects (randomly chosen) with viewpoints within 60 degrees of each other, 
for classes ’’aeroplane”, ’’bottle”, ”dog”, ’’person” and ’’tvmonitor” from PASCAL VOC. The matchings have been shown 
for features from layer conv-4 of AlexNet-20K, AlexNet-lOOK, AlexNet-lM, KittiNet-20K and SIFT The left image shows 
the ground truth keypoints that were matched with the keypoints in the right image. Right images shows the location of 
the ground truth keypoint (shown by solid dot) and lines joining the predicted keypoint location (tip of the line) with the 
ground keypoint location. Please see section |^for details of keypoint matching procedure and figure 4 in the main paper for 
numerical results. This figure is best seen in color and with zoom. 
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Figure 6: Intra-class keypoint matching error as a function of viewpoint distance averaged over 20 PASCAL objects using 
features extracted from layers pool-2 (left) and conv-5 (right) of various networks used in this work. Please see section [53] 
for more details. 
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